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ABSTRACT 


This paper provides an alternative way of document representation 
by treating topic probabilities as a vector representation for words 
and representing a document as a combination of the word vectors. 
A comparison on summary data shows that this representation is 
more effective in document classification. 
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1. INTRODUCTION 


Topic modeling has been one of the most important methods in 
natural language analysis. It helps to discover underlying topics in 
a collection of documents. The found topics are used to form topic 
features for documents. The topic features are then used as input to 
perform task such as document clustering [11], automated 
summarization [1], automated essay grading [6], etc. LDA (Latent 
Dirichlet Allocation) [2, 3] is the most popular way for topic 
modeling. LDA topic model provides topic proportions as a vector 
representation of document. We investigated an alternative way of 
document representation by summing up word probabilities from 
LDA topic model. The new representation is compared with the 
topic proportion representation as input of a document clustering 
task on a summarization data set. The results showed that the 
simple “probability sum” document representation performs better. 


2. LDA and Document Representations 

Latent Dirichlet allocation (LDA), first introduced by Blei, Ng and 
Jordan in 2003 [3], is one of the most popular methods in topic 
modeling. LDA represents topics by word probabilities. Given a 
vocabulary with N words, {wy,w2,°:',Wy}, the LDA model 
probabilities Py = (py (wi), Pe(W2),°,Pe(Wn)) form a 
representation of the k*” topic (k = 1,2,---,K). The words with 
highest probabilities in each topic usually give a good idea about 
what the topic is. 


In LDA, a document d has an inferred topic proportion which is 
usually used as topic features to represent the document: 


T(d)~((@), t2(@),--, tx(@)). 


From the point of view of statistics, topic proportion is probably the 
only choice for LDA-based document representation. However, if 
we jump out of the box of statistics, we can simply view the word 
probabilities across the K topics as a K -dimensional vector 
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representation for each word. Thus, a document can be represented 
by summing up the word probability vectors: 


N 
Sx(d) = Y: pe(wi) log + f(wi d)), (k = 1,2,-,K) 
i=1 


In the above formula, s;,(d) is the “probability sum” of the 
document d on the k*” topic, pz (w;) is the probability of the word 
w; on the k“" topic, and f (w;,d) is the frequency of the word w; 
in the document d. The logarithm of word frequency is known as 
Zipf scale [9]. 


3. Corpus for Document Clustering 

201 participants wrote 1481 summaries for 8 passages, about 185 
for each passage [10]. The lengths of the passages ranged from 195 
to 399. The Flesch-Kincaid grade level was from 8.6 to 11.7. Some 
passages had similar topics: Working and Running, Kobe and 
Jordan, and Effects of Exercising on sports and exercising; and 
Floods and Hurricane on disasters. 


The summaries were collected from an online experiment. The 
original goal was to evaluate the effect of an online AutoTutor [5, 
9] lesson that teaches summarization. Each subject composed 
summaries for 2 texts before learning the lesson, 2 after learning, 
and 4 during learning with a counter-balanced design. The 
participant wrote each summary immediately after reading a 
passage. The system automatically controlled summary length (50- 
100 words) and plagiarism. The summary could not be submitted 
when it was out of range or when it had 10 consecutive words 
copied from the original passage. 


Each summary was treated as a document for topic modeling. The 
vocabulary size was 4275 after removing stop words. 6 topic 
models were built for different numbers of topics (4, 8, 12, 16, 20 
and 24), respectively. For each model, the topic proportions and the 
probability sums were computed for each summary. The LDA 
package used for topic modeling was infer.net from Microsoft [8]. 


Topic proportions and probability sums were then used as 
document features for clustering. We used K-Mean clustering 
method and fixed the number of clusters to 8 for all 6 topic models. 


4. Results 


We define the similarity of two clustering results by 


Sim = ye number of shared documents in cluster pair i 


total number of documents 
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The cluster pairs were best arranged using “Hungarian Algorithm” 
[7] so that the similarity is the highest under the paring. For each of 
the two document representations, we first compared the cluster 
similarity between models with the number of topics 4 and 8, 8 and 
12, 12 and 16, 16 and 20, and 20 and 24. We aimed to check 
whether or not the clusters converge as the number of topics 
increases. 


The results showed that when the number of topics increased, 
clustering based on probability sum quickly converged. The 
similarity between 12 topics and 16 topics was 0.96. For topic- 
proportion-based clustering, the similarity between 8 and 12 topics 
went close to probability sum. However, it dropped at 12 and 16, 
and then went up to 0.81 for 20 and 24. 


While both representations converged to some clusters, the topic- 
proportion-based clustering converged to the unevenly distributed 
clusters. The largest two clusters contained 908 documents out of 
1480. In contrast, probability-sum-based clustering converged to 
clusters of sizes almost the same as the original summary groups. 


Table 1 shows the best matched clusters to the original passages for 
24-topic model. Topic-proportion-based clusters matches the 
original passage groups with a similarity of 0.60, whereas 
probability-sum-based clustering did surprisingly better. The 
cluster similarity to the original summary grouping was 0.98. 


Table 1 Best matched clusters to original passages 


1 2 3 4 5 6 7 8 

Topic Proportion Based Clusters 
BM | 160 0 0 0 0} 20 1 2 
Di 6 5 | 101 1 0} 69 0 0 
EE 0 1 | 186 0 1 1 0 0 
Fl 11 7 21 1 1 | 139 5 1 
Hu 1 0 1 1 | 173 3 5 0 
JM 0 0 1 0 0 | 179 0 1 
KJ 0 0 0 0 1 1 | 185 1 
WR 1 0 | 164 0 1 20 0 1 

Probability Sum Based Clusters 
BM | 180 0 1 0 1 1 0 
Di 0 | 176 0 0 0 6 0 0 
EE 0 1 | 182 0 0 5 1 0 
Fl 0 0 0 | 179 1 6 0 0 
Hu 0 0 0 0 | 180 4 0 0 
JM 0 1 0 0 0 | 179 1 0 
KJ 0 0 0 0 1 1 | 186 0 
WR 0 0 2 0 0 4 0} 181 


Note: BM=Butterfly and Moth, Di=Diabetes, EE=Effects of 
Exercising, FI=Floods, Hu=Hurricane, JM=Job Market, KJ=Kobe 
and Jordan and WR=Working and Running. 


The cluster similarity changed when the number of topics increased 
in topic modeling. The topic-proportion-based clustering had its 
highest cluster similarity 0.77 to the original grouping when the 
number of topics is 12. It then dropped below 0.60. The probability- 
sum-based clustering had higher similarities for all models than 
topic proportion and consistently converged toward 1. 
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